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Abstract 

This paper is devoted to regret lowe r bounds in the cl assical model of stochastic multi- 
ar med bandit. A well-know n result of Lai and Robbinsl . which has then been extended 
by iBurnetas and Katehakisl . has established the presence of a logarithmic bound for all 
consistent policies. We relax the notion of consistence, and exhibit a generalisation of the 
logarithmic bound. We also show the non existence of logarithmic bound in the general 
case of Hannan consistency. To get these results, we study variants of popular Upper 
Confidence Bounds (ucb) policies. As a by-product, we prove that it is impossible to 
design an adaptive policy that would select the best of two algorithms by taking advantage 
of the properties of the environment. 

Keywords: stochastic bandits, regret bounds, selectivity, UCB policies. 



1. Introduction and notations 

Multi-armed bandits are a classical way to illustrate the difficulty of decision making in the 
case of a dilemma between exploration and exploitation. The denomination of these models 
comes from an analogy with playing a slot machine with more than one arm. Each arm has 
a given (and unknown) reward distribution and, for a given number of rounds, the agent has 
to choose one of them. As the goal is to maximize the sum of rewards, each round decision 
consists in a trade-off between exploitation (i.e. playing the arm that has been the more 
lucrative so far) and exploration (i.e. testing an other arm, hoping to discover an alterna- 
tive that beats the current best choice). One possible application is clinical trial, when one 
wants to heal as many patients as possible, when the l atter arrive s equen tially and when the 
effectiveness of each treatme nt is initially un known ( Thompson . 19331 ). Bandit problems 
has initially been studied by iRobbinsI (119521) . and its interest has then been extended t o 
many f ields such as economic s ( Lamberton et al. . 2004 : iBergemann and Valimaki . 2008 ). 
games ( Gellv and Wang . 20061 ). optimisation ( Kleinberg . 20051 ; ICoquelin and Munoa . 2007 ; 
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Kleinberg et all . 120081 : iBubeck et all 120091 ) 



Let us detail our model. A stochastic multi-armed bandit problem is defined by: 

• a number of rounds n, 

• a number of arms K > 2, 

• an environment 8 = {fx-,--- ,uk), where each i/& (k G {1, ■ ■ ■ ,K}) is a real- valued 
measure that represents the distribution reward of arm k. 

We assume that rewards are bounded. Thus, for simplicity, each is a probability on 
[0,1]. Environment 9 is initially unknown by the agent but lies in some known set G of 
the form Oi x ... x @k, meaning that 0^ is the set of possible reward distributions of 
arm k. For the problem to be interesting, the agent should not have great knowledges of 
its environment, so that B should not be too small and/or contain too trivial distributions 
such as Dirac measures. To make it simple, each G/% is assumed to contain the distributions 
p5 a + (1 — p)5b, where p,a,b G [0, 1] and 5 X denotes the Dirac measure centred on x. In 
particular, it contains Dirac and Bernoulli distributions. Note that the number of rounds 
n may or may not be known by the agent, but this will not affect the present study. Some 



aspects of this particular point can be found in lSalomon and Audibertl (|201ll ). 
The game is as follows. At each round (or time step) t = 1, ■ • ■ , n, the agent has to 
choose an arm It in the set of arms { 1 , • • • , K} . This decision is based on past actions and 
observations, and the agent may also randomize his choice. Once the decision is made, the 
agent gets and observes a payoff that is drawn from vi t independently from the past. Thus 
we can describe a policy (or strategy) as a sequence (ot)t>i ( or (ct)i<t<n if the number of 
rounds n is known) such that each at is a mapping from the set {1, . . . , K} l ~ l x [0, l]*" 1 of 
past decisions and outcomes into the set of arm {1, . . . , K} (or into the set of probabilities 
on {1, . . . , K}, in case the agent randomizes his choices). 

For each arm k and all times t, let T^{t) = ^ g=1 ^-h=k denote the number of times arm k 
was pulled from round 1 to round t, and X^i, X^p, . . . , X^t u\ the corresponding sequence 
of rewards. We denote by Pq the distribution on the probability space such that for any 
k G {1, . . . , K}, the random variables Xk t i, X^, ■ ■ ■ , X^ n are i.i.d. realizations of Uk, and 
such that these K sequences of random variables are independent. Let Eg denote the 
associated expectation. 

Let = J xdvk{x) be the mean reward of arm k. Introduce //* = maxj. e r 1> ^1 fj,^ and 
fix an arm k* G argmax j ( c6 r li K \ /j.^, that is k* has the best expected reward. The agent 
aims at minimizing its regret, defined as the difference between the cumulative reward he 
would have obtained by always drawing the best arm and the cumulative reward he actually 
received. Its regret is thus 

n n 

R n = ^X k * )t - ^2 x i t ,T h (t)- 

t=l i=l 

As most of the publications on this topic, we focus on expected regret, for which one 
can check that: 

K 



E R n = J2 A kMTk{n)}, (1) 



k=l 



2 



Regret lower bounds and extended ucb policies 



where is the optimality gap of arm k, defined by A k = fi* 
gap between the best arm and the second best arm, i.e. A := 



- . We also define A as the 
mhifc^fc. A fc . 



Previous works have sho wn the existence o f lowe r bounds on the performance of a large 
class of policies. In this wav lLai and Robbind (|l985h proved a lower bound of the expected 
regret of order log n in a particular parametri c framework, and they al s o exh ibited optimal 
policies. This work has then been extended bv lBurnetas and Katehakid (| 19961 k Both papers 
deal with consistent policies, meaning that all the policies considered are such that: 



Va > 0, Vflee, K e [R n ] = o{n a ). 



(2) 



cnviron- 



The logarithmic bound of Burnetas and Katehakisl is expressed as follows. For all 
ment 6 = (ui, ■ ■ ■ , uk) and all k E {1, ... , K}, let us set 

D k (6) := inf KL(u k ,i> k ), 

u k eB k :E[u k ]>fi* 

where KL{v, n) denotes the Kullback-Leibler divergence of measures u and fi. Now fix a 
consistent policy and an environment 8 E O. If ft is a suboptimal arm (i.e. fi k ^ /j,*) such 
that < D k {0) < +oo, then 



Ve > 0, 



lim ] 

n— >+oo 



T k {n) > 



[1 — e) log n 



D k (0) 



1. 



This readily implies that: 



liminf WM> 



logn 



D k {6) 



Thanks to Equation ([T]), it is then easy to deduce a lower bound on the expected regret. 
One contribution of this paper is to extend this bound to a larger class of policies. We will 
define the notion of a-consistency (a E [0, 1]) as a variant of Equation ([2]), where equality 
E#[i? n ] = o(n a ) only holds for all a > a. We show that the logarithmic bound still holds, 
but coefficient is turned into jj^nk ■ We also prove that the dependence of this new 

bound in the term 1 — a is asymptotically optimal when n — > +oo (up to a constant). 
As any policy achieves at most an expected regret of order n (because the average cost of 
pulling an arm k is a constant A&), it is also natural to wonder what happens when expected 
regret is only required to be o(n). This notion is equivalent to Hannan consistency. In this 
case, we show that there is no logarithmic bound any more. 



Some of our results are obtained thanks to a st udy of particular Uppe r Confidence 
Bound algorithms. These policies were introduced by Lai and Robbins ( 19851 ): it basically 
consists in computing an index at each round and for each arm, and then in selecting the 
arm with the greatest index. A simple and efficient way to design such policies is to choose 
indexes that are upper bounds of the mean reward of the considered arm that hold with 
high probabilit y (or, say, with high confiden ce level). Tlxi s id ea can be traced back to 
Agrawall ( 19951 ). and has been popularized by Auer et al. ( 20021 ). who notably described a 
policy called ucbI. For this policy, each index is defined by an arm k, a time step t, and 
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an integer s that indicates the number of times arm k has been pulled before stage t. It is 
denoted by B}. st and is given by: 



B 



k.s.t 



21ogt 



where X^ 8 is the empirical mean of arm k after s pulls, i.e. X^^ = - Y2u=i -^k,u- 
To summarize, UCBl policy first pulls each arm once and then, at each round t > K, selects 
an arm k that maximizes -Bfc,T fc (i-i),t- Note that, by means of Hoeffding's inequality, the 
index -Bfc,T fc (t-i),t is indeed an upper bound of p k with high probability (i.e. the probability 
is greater than 1 — 1/t 4 ). Note also that a way to look at this index is to interpret the 
empiric mean Xf.^H—1) as an "exploitation" term, and the square root as an "exploration" 
term (as it gradually increases when arm k is not selected). 

The policy UCBl achieves the logarithmic bound (up to a multiplicative constant), as it was 
shown that: 



MO G 6, Vn > 3, E e [T k (n)} < 12^^ and E e R n < 12 ^ < 12 



K 



log n 



fc=i 



log n 

"a - ' 



Audibert etaD (120091 ) studied some variants of UCBl policy. Among them, one consists in 
changing the 2 log t in the exploration term into p log t, where p > 0. This can be interpreted 
as a way to tune exploration: the smaller p is, the better the policy will perform in simple 
environments where information is disclosed easily (for example when all reward distribu- 
tions are Dirac measures). On the contrary, p has to be greater to face more challenging 
environments (typically when reward distributions are Bernoulli laws with close parame- 
ters). 



This policy, that we denote UCB(p), was proven bv lAudibert et al.l to achieve the logarith- 
mic bound when p > 1, a nd the optimality was also obtained when p > I for a variant 
of VCB(p). iBubeckl feoid ) showed in his PhD dissertation that their ideas actually en- 
able to prove optimality of ucb(p) for p > |. Moreover, the case p = | corresponds 
to a confidence level of \ ( in view of Hoeffding's inequality, as above), and several stud 
ies (Lai and Robbins , 1985 : AgrawaJ . 19951 : Burnetas and Katehakisl . 1996 : Audibert et al. 



20091 : iHonda and Takemural . l2010l ) have shown that this level is critical. We complete these 



works by a precise study of vCB(p) when p < ^ . We prove that UCB(p) is (1 — 2/))-consistent 
and that it is not a-consistent for any a < 1 — 2p (in view of the definition above, meaning 
that expected regret is roughly of order n 1 ~ 2p ). Not surprisingly, it performs well in simple 
settings, represented by an environment where all reward distributions are Dirac measures. 
A by-product of this study is that it is not possible to design an algorithm that would 
specifically adapt to some kinds of environments, i.e. that would for example be able to 
select a proper policy depending on the environment being simple or challenging. In par- 
ticular, and contrary to the results obtained within the class of consistent policies, there 
is no optimal policy. This contribution is linked with selectivity in on-line learning prob- 
lem with perfect information, commonly addressed by predictio n with expert advice such 



as alg orithms with exponentially weighted forecasters (see, e.g., ICesa-Bianchi and Lugosi 
In this spirit, a closely rel ated problem to o urs is the one of regret against the 



best strategy from a pool studied by Auer et al. ( 20031 ): the latter designed a policy in the 
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context of adversarial/nonstochastic bandit whose decisions are based on a given number 
of recommendations (experts), which are themselves possibly the rewards received by a set 
of given algorithms. To a larger extent, model selection ha ve been intensively studied in 
statistics, and is commonly solved by penalization methods dMallowd . Il973l : lAkaikj . Il973l : 
Schwar zL [l978h . 



Finally, we exhibit expected regret lower bounds of more general UCB policies, with the 
21ogt in the exploration term of UCBl replaced by an arbitrary function. We obtain Han- 
nan consistent policies and, as mentioned before, lower bounds need not be logarithmic any 
more. 

The paper is organized as follows: in Section 2 we give bounds on the expected regret of 
UCB(p) (p < ^). In Section 3 we study the problem of selectivity. Then we focus in Section 
4 on a-consistent policies, and we conclude in Section 5 by results on Hannan consistency 
by means of extended UCB policies. 

Throughout the paper \x\ denotes the smallest integer which greater than the real x, and 
Ber(p) denotes the Bernoulli law with parameter p. 

2. Bounds on the expected regret of ucb( / o), p <\ 

In this section we study the performances of UCB(p) policy, with p lying in the interval 
(0, \). We recall that UCB(p) is defined by: 

• Draw each arm once, 

• Then at each round t, draw an arm 



I t G argmax <^ X^^y + 



plogt 



Small values of p can be interpreted as a low level of experimentation in the balance between 
exploration and exploitation, and present literature has not provided precise regret bound 
orders of VCB(p) with p G (0, |) yet. 

We first study the policy in simple environments (i.e. all reward distributions are Dirac 
measures), where the policy is supposed to perform well. We show that its expected regret 
is of order plo J^ n (Proposition [T] for the upper bound and Proposition [2] for the lower bound). 
These good performances are compensated by poor results in complexer environments, as we 
then prove that the overall expected regret lower bound is roughly of order n 1 ^ 2p (Theorem 

El. 

Proposition 1 Let < b < a < 1 and n > 1. For 9 = (5 a ,<5{,) ; the random variable T2(n) 
is uniformly upper bounded by log(n) + 1. Consequently, the expected regret of UCB (p) 
is upper bounded by log(n) + 1. 

Proof Let us prove the upper bound on the sampling time of the suboptimal arm by 
contradiction. The assertion is obviously true for n = 1 and n = 2. If the assertion is false, 
then there exists t > 3 such that T 2 (t) > -fa log(i) + 1 and T 2 (t - 1) < log(t - 1) + 1. 
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Since log(t) > log(i — 1), this leads to T 2 (t) > T 2 (t — 1), meaning that arm 2 is drawn at 

time t. Therefore, we have a + y ^rjz^pz^ < + yj ^(t-l) ' ^ ence A ^ v/ ^(t-iV which 

implies T2(i — 1) < p an d thus ?2(i) < p ^ + 1. This contradicts the definition of t, 
which ends the proof of the first statement. The second statement is a direct consequence 
of Formula JT]). ■ 

The following shows that Proposition Q] is tight and allows to conclude that the expected 
regret of UCB(p) is equivalent to 4 log(ra) when n goes to infinity. 



Proposition 2 Let < b < a < 1, n > 2 and h : t ^ ^log(i)(l + ^ (T-'ifi^ ) • • For 
= (^ai^b); random variable ?2(n) is uniformly lower bounded by 



n 



f{n) = J mm(h'(s),l)ds-h(2). 

As a consequence, the expected regret oflJCB(p) is lower bounded by A/(n). 

Straightforward calculations shows that h'(s) < 1 for s large enough, and this explains why 
our lower bound A/(n) is equivalent to Ah(n) ~ log(n) as n goes to infinity. 

Proof First, one can easily prove (for instance, by induction) that T2(t) < t/2 for any t > 2. 
Let us prove the lower bound on T^n) by contradiction. The assertion is obviously true 
for n = 2. If the assertion is false for n > 3, then there exists t > 3 such that T 2 (t) < f(t) 
and T 2 (t - 1) > /(i - 1). Since f'(s) G [0, 1] for any s > 2, we have f(t) < f(t - 1) + 1. 
These last three inequalities imply T 2 (t) < T2(t — 1) + 1, which gives T2(t) = T2{t — 1). This 
means that arm 1 is drawn at time t. We consequently have 



a + J ^ lpl ° m 



hence 



t-l-T 2 {t-l) ~ V T 2 (t-1) 



A 1 1 1 V2 

> — > 



y/^IoKt) y/T 2 {t-l) y / t-l-T 2 (t-l) y/T 2 (t ~ 1) Vt^T 

We then deduce that T 2 (t) = T 2 (t — 1) > /i(t) > /(i). This contradicts the definition of 
t, which ends the proof of the first statement. Again, the second statement results from 
Formula ([T|). ■ 

Now we show that the order of the lower bound of the expected regret is n l ~ 2p . Thus, for 
p G (0, ^), uCB(p) does not perform enough exploration to achieve the logarithmic bound, 
as opposed to UCb(/j) with p G +oo). 
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Theorem 3 For any p 6 (0, |) ; any 9 6 O and any (3 G (0, 1), one /ias 



4 logn / logn . \ n 



k:A k >0 



+ 1 



io g (i//3) ; 1 - 2 P p 



l-2p/3 



Moreover, for any e > 0, there exists 9 E suc/i i/iai 



lim _ = +oo. 

n— >+oo n *P 6 



Proof Let us first show the upper bound. The core of the proof is a peeling argu- 
ment and makes use of Hoeffding's maximal inequality. The idea is originally taken from 
Audibert et al. (|2009h . and the following is an adaptation of the proof of an upper bound 
in the case p > \ which can be found in S. Bubeck's PhD dissertation. 
First, let us notice that the policy selects arm k such that A k > at step t only if at least 
one of the three following equations holds: 



T k (t-1) < 



T k {t - 1) ' 
4/3 log n 



(3) 
(4) 
(5) 



Indeed, if none of the equations holds, then: 



Bk*,T k *(t-i),t > p* = Pk + A k > p k + 2* 



' p log n 
T k (t - 1} 



> 4* + 



' p log t 
T k (t - 1) 



B 



k,T k (t-l),t- 



We denote respectively by £i,t,£2,t and £3^ the events corresponding to Equations Q, 
and ©. 
We have: 



E e [T k (n)} = E 

4 logn 



<1^ + E 



t=\A logn/A|l 



< 



+ E 



t=r41ogn/A2 



fc t=r41ogn/A2] 
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We now have to find a proper upper bound for P(£i,t) and P(^2,t)- To this aim, we apply 
the peeling argument with a geometric grid over the time interval [1, t]: 



»(^ t ) < P I 3s €{!,■■■ ,t}, X fev + 4/£^l£< M * 



log(l//3) 



< ^ P ( 3s : < a < fit}, X fe *, s + yj ' ^1 < /i* 



log(l//3) 



3=0 V i=l / 

log(l//3) / s \ 

< P (3a : {fi+H < s < f3H}, J]^,, - // < -^pfi+Hlogt 

3=0 \ 1=1 / 

By means of Hoeffding-Azumas inequality for martingales, we then have: 

logt \ 1 



log(l//3) 




log(l//3) y i 2 ^ 



3=0 

and, for the same reasons, this bound also holds for P(^2,t)- 
Combining the former inequalities, we get: 



^, rm / m 41ogra J\ ( logt \ 1 

E.Pi(»)] < ^h + 2 E (i^(fz8) +1 )iW < 6 > 

41ogn / logra \ 1 



- A? Viog(i//3) 7 ^ , t 2p/3 

41ogra / logra A f n 1 
41ogn / log n \ n x ~ 2p $ 



A 2 [ ~ Vlog(l//3) ' 7 1-W 
As usual, the bound on the expected regret then comes formula (pQ). 

Now let us show the lower bound. The result is obtained by considering an environment 
9 of the form \Ber{\), 5i _ A ) , where A > is such that 2p(l + \/A) 2 < 2p + e. We set 
T n := r^pi, and define the event £ n by: 

«„ = {*«,<i-d + -^)A}. 
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When event £ n occurs, for any t 6 {T n , . . . , n} one has 

x l,T n ~ 



plogt 



< * 1 , J .. + ,/ip=<i-(i + -J r )A + vs 



T 

J- 7 



< i-A, 
- 2 



so that arm 1 is chosen no more than T n times by UCB(p) policy. Thus: 

E e [T 2 (n)}>F e (Zn)(n-T n ). 

We shall now find a lower bound of the probability of £ n thanks to Berry-Esseen inequality. 
We denote by C the corresponding constant, and by 3? the c.d.f. of the standard normal 
distribution. For convenience, we also define the following quantities: 



a :- 



E 



1 



1 



M-X := E 



1 



1 



Using the fact that <&(— x) 



x 2 

e""2" 



2lTX 



P{x) with /3(x) 



-J- 1, we are then able to write: 



Xi: 



T n < -2 1 + 



> 



V VA 

CM 3 

exp(-2(^ + l)(A + ^A) 2 



AWT n 



> $f-2(A + \/A)V^ 



> ri 



2^(A + VA)V% V 
exp (-2(A + VA) 2 



f3 (2(A + VA)^ 
p(2{A + VA)^/¥ n 



2V2tt(A + VA)V?; 
Previous calculations and Formula ([T]) gives 

E e [R n ] = AE e [T 2 (n)] > A¥ e ^ n )(n - T n ) 
and the former inequality easily leads to the conclusion of the theorem 



CM 3 

CM 3 
3,/T" 



a 



3. Selectivity 

In this section, we address the problem of selectivity in multi-armed stochastic bandit 
models. By selectivity, we mean the ability to adapt to the environment as and when 
rewards are observed. More precisely, it refers to the existence of a procedure that would 
perform at least as good as the policy that is best suited to the current environment 6 
among a given set of two (or more) policies. Two mains reasons motivates this study. 
On the one hand this question was answered by iBurnetas and Katehakisl within the class 
of consistent policies. Let us recall the definition of consistent policies. 
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Definition 4 A policy is consistent if 

Va > 0, V# G G, E e [R n ] = o(n a ). 

Indeed t hey show the existence of lower bounds on the expected regret (see Section 3, Theo- 
rem 1 of Burnetas and Katehakis (jl996l )). which we also recall for the sake of completeness. 



Theorem 5 Fix a consistent policy and 6 G 0. If k is a suboptimal arm (i.e. < /J,*) 
and if < Df.{9) < +oo, then 



Ve > 0, lim ] 

71— >- + 00 



T k {n) > 



(1 — e) logn 
D k (0) 



1. 



Consequently 



1 



n->+oo logn Dkip) 



Remind that the lower b ound on the expected regret is then deduced from formula (PQ). 
Burnetas and Katehakisl then exhibits an asymptotically optimal policy, i.e. which achieves 



the former lower bounds. The fact that a policy does as best as any other one obviously 
solves the problem of selectivity. 

Nevertheless one can wonder what happens if we do not restrict our attention to consistent 
policies any more. Thus, one natural way to relax the notion of consistency is the following. 



Definition 6 A policy is a-consistent if 

Va > a, V# G 6, E e [R n ] = o{n a ). 

For example we showed in the former section that UCB(p) is (1 — 2p)-consistent for any 
p G (0, i). The class of a-consistent policies will be studied in Section HI 
Moreover, as the expected regret of any policy is at most of order n, it seems simpler and 
relevant to only require it to be o(n): 

V# G 0, E e [R n ]=o(n), 

which corresponds to the definition of Hannan consistency. The class of Hannan consistent 
policies includes consistent policies and a-consistent policies for any a G (0, 1). Some results 
on Hannan consistency will be provided in Section [5j 



On th e other hand, this problem has already been studied in the context of adversarial 
bandit bv lAuer et ail (120031 ). Their setting differs from our not only because their bandits 



are nonstochastic, but also because their adaptive procedure takes only into account a 
given number of recommendations, whereas in our setting the adaptation is supposed to 
come from observing rewards of the chosen arms (only one per time step). Nevertheless, 
there are no restrictions about consistency in the adversarial context and one can wonder 
if an "exponentially weighted forecasters" procedure like Exp4 could be transposed to our 
context. The answer is negative, as stated in the following theorem. 
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Theorem 7 Let A be a consistent policy and let p be a real in (0,0.4). There are no policy 
which can both beat A and XJCB(p), i.e.: 

VA, 39 e 6, limsup ^ > 1. 

n^+oo min(E [R n (A)} , K e [R n (ucb (p) )] ) 

Thus there are no optimal policy if we extend the notion of consistency. Precisely, as 
UCB(p) is (1 — 2p)-consistent, we have shown that there are no optimal policy within the 
class of o-consistent policies (which is included in the class of Hannan consistent policies), 
where a > 0.2. 

Moreover, ideas from selectivity in adversarial bandits can not work in the present context. 
As we said, this impossibility may also come from the fact that we can not observe at each 
step the decisions and rewards of more than one algorithm. Nevertheless, if we were able 
to observe a given set policies from step to step, then it would be easy to beat them all: it 
is then sufficient to aggregate all the observations and simply pull the arm with the greater 
empiric mean. The case where we only observe decisions (and not rewards) of a set of 
policies may be interesting, but is left outside of the scope of this paper. 

Proof Assume by contradiction that 



3A, V6> G 0, limsupu ni 6i < 1, 



where Un0 - M^)L 



min(E e [Rn (A)] ,Eg [R n (UCB(p))] ) 

One has 

MRn{A)\ < u nj eEe[Rn(A)], 
so that the fact that A is a consistent policy im plies that A is also consistent. Consequently 



the lower bound of iBurnetas and Katehakisl has to hold. In particular, in environment 



9 = (6q, 5a) one has for any e > and with positive probability (provided that n is large 
enough) : 

(1 -g) logn 

Tl(n) - Dm ■ 

Now, note that there is simple upper bound of Dk(9): 

D k (9) < inf KL(5 ,p5 + (l-p)5 a ) 

p,oe[0,l]:(l-p)a>A 

inf log ( — J = log 



p,ag[0,l]:(l-p)a>A \p J \ 1 - A 

And on the other hand, one has by means of Proposition [2j 

Thus we have that, for any e > and if n is large enough 

plogn (1 - g) log n 
A2 " 1o S (t4 
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Letting e go to zero and n to infinity, we get: 

P 



> 



log M 



A 2 



This means that p has to be lower bounded by A 
A = 0.75, hence the contradiction. 



1-A I 



but this is greater than 0.4 if 



Note that the former proof give us a simple alternative to Theorem[3]to show that UCb(/j) 
is not consistent if p < 0.4. Indeed if it were consistent, then in environment 9 = (d~o,5&), 
T\ (n) would also have to be greater than ^^(g'f" an ^ l° wer than 1 + pl °fj" , and the same 



contradiction would hold. 



4. Bounds on a-consistent policies 



We now study a-consistent policies. We first show that the main result of lBurnetas and Katehakis 
(Theorem [5]) can be extended in the following way. 

Theorem 8 Fix an a-consistent policy and 9 E G. If k is a suboptimal arm and if < 
D k (9) < +oo, then 



\/e > 0, lim 



n— >+oo 



T k (n) > (1-e) 



[1 — a) log n 
D k {0) 



Consequently 



liminf WM> 1 -« 



n-^+oo logn Dk{0) 
Recall that, as opposed to lBurnetas and Katehakid (|l99d ^. there are no optimal policy (i.e. 



a policy that would achieve the lower bounds in all environment 9), as proven in the former 
section. 



Proof We adapt Proposition 1 in Burnetas and Katehakis ( 19961 ) and its proof, which 
may have a look at for further details. We fix e > 0, and we want to show that: 



one 



lim ] 

n— >+oo 



T k (n) > (l_ e )(l- a ; 



log n 



D k {6) 



Set 5 > and 5' > a such that > (1 — e )(l — °0- By definition of D k (9), there exists 9 
such that Egpfj^i] > p* and 

D k (9) < KL(u k , u k ) < (1 + 5)D k (9)E 



1. In iBurnetas and Katehakid (|l996l l. D k (6) is denoted K a (£) and KL{v k ,v h ) is denoted I(^ a ,C). The 
equivalence between other notations is straightforward. 
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where we denote 6 = iy\, . . . , vk) and = (z?i, . . . , vk)- 
Let us define I s = KL(u k , v k ) and the sets 



< ~~F~~} ' Cf := 0°g^(n) < (1 " logn} 



where 5" is such that a < 5" < 5' and Lj is defined by log Lj = Yl{=i 1°S fs^O^fc,'. 

We show that P fl «) = P e « n Cf ) + P e « \ Cf ) ► 0. 

On the one hand, one has: 



p e (<ncf) < eW>** n F $ (A*nc*') (7) 

< n 1 " 5 '^^') = n l - 5 > § L - T k {n) >n- log 

n^'Egln - T k (n)} 
n j3~ log n 



< E^'* EggKtQ] ^ Q 

L n 



where ((7]) is consequence of the definition of , (jHJ) comes from Markov's inequality, and 
where the final limit is a consequence of the a-consistence. 
On the other hand we set b n := ^j§- logn, so that we have: 

P„(<\Cf) < P ( max log Lj > (1 — 5") log n\ 
< P max log > J 5 1 ~ ^ 



This term then tends to zero, as a consequence of the law of large numbers. 

Now that W e (A%) tends to zero, the conclusion comes from the following inequality: 

1-5' 1-5' (l-e)(l-a) 
> > 



I s (l + S)D k (e) ~ D k {6) 



The former lower bound is asymptotically optimal, as claimed in the following proposi- 
tion. 

Proposition 9 There exists 6 6 O and a constant c > such that, for any a € [0, 1), there 
exists an a- consistent policy and k ^ k* such that: 

. , E e [T k (n)} ^ 
lim mf 7 — rV - ^ — < c - 

n->+oo (1 — a) log n 
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Proof By means of Proposition [TJ the following holds for UCB( / o) in any environment of 
the form = (8 a , 5b) with a ^ 6: 

liminfg^M<A 
n-t+oc log n /\ z 

where k ^ k* . 

As UCb(p) is (1 — 2/))-consistent (Theorem [3|) , we can conclude by setting c 
choosing the policy \JCb(^-^-). 



i 

2A 2 



and by 



5. Hannan consistency and other exploration functions 

We now study the class of Hannan consistent policies. We first show the necessity to have a 
logarithmic lower bound in some environments 8, and then a study of extended UCB policies 
will prove that there does not exist a logarithmic bound on the whole set 0. 

5.1 The necessity of a logarithmic regret in some environments 

A simple idea enables to understand the necessity of a logarithmic regret in some envi- 
ronments. Assume that the agent knows the number of rounds n, and that he balances 
exploration and exploitation in the following way: he first pulls each arm s(n) times, and 
then selects the arm that has obtained the best empiric mean for the rest of the game. If 
we denote by p s ( n ) the probability that the best arm does not have the best empiric mean 
after the exploration phase (i.e. after the first Ks{n) rounds), then the expected regret is 
of the form 

ci(l-p s (n))s(n) + c 2 p s („)n. (9) 

Indeed if the agent manages to match the best arm then he only suffers the pulls of subop- 
timal arms during the exploration phase, and that represents an expected regret of order 
s{n). If not, the number of pulls of suboptimal arms is of order n, and so is the expected 
regret. 

Now we can approximate p s ( n ), because it has the same order as the probability that the 
best arm gets an empiric mean lower than the second best mean reward, and because 



-Ws(n) (where a is the variance of X^*^) approximately has a standard normal 
distribution by the central limit theorem: 



Ps(n) * P»(* fc .,,(n) < /** - A) = P, | ^' Sin) ^F) < ~ Ax/iR 



1 




a 
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Then it is clear why the expected regret has to be logarithmic: s(n) has to be greater than 
log n if we want the second term p s ( n )n of Equation ([9|) to be sub- logarithmic, but then first 
term (1 — p s r n \)s(n) is greater than logn. 

This idea can be generalized, and this gives the following proposition. 



Proposition 10 For any policy, there exists 9 £ and such that 

hm sup > 0. 

n-^+oo logn 



This result can be seen as a consequence of the main result of Burnetas and Katehakisl 



(Theorem [5]) : if we assume by contradiction that lim sup n _ s>+00 = for all 9, the con- 
sidered policy is therefore consistent, but then the logarithmic lower bounds have to hold. 
The reason why we wrote the proposition anyway is that our proof is based on the simple 
reasoning stated above and that it consequently holds beyond our model (see the following 
for details). 



Proof The proposition results from the following property on 0: there exists two environ- 
ments 9 = (z/i, . . . , vk) and 6 = (vi, . . . , vk) and k € {1, . . . , K} such that 

• k has the best mean reward in environment 9, 

• k is not the winning arm in environment 9, 

• Vk = Vk and there exists r\ S (0, 1) such that 

U^(X e ,i)>V P tf — o.s. (10) 



The idea is the following: in case vj. = i>/% is likely to be the reward distribution of arm 
k, then arm k has to be pulled often for the regret to be small if the environment is 9, 
but not so much, as one has to explore to know if the environment is actually 9 (and the 
third condition ensures that the distinction can be tough to make). The lower bound on 
exploration is of order logn, as in the sketch in the beginning of the section. 

The proof actually holds for any that has the above-mentioned property (i.e. without 
the assumptions we made on 0, i.e. being of the form ©i x . . . x 0^ and/or containing 
distributions of the form p5 a + (1 — p)6b)- In our setting, the property is easy to check. 
Indeed the three conditions hold for any k and any pair of environments 9 = (i/j, . . . , Vr), 
9 = (v\, . . . , vk) such that each vg (resp. vg) is a Bernoulli law with parameter pi (resp. pi) 
and such that 



• W ^ k, p k > pi, 

• 3£ / k, p k < Pt, 

• Pk= Pk and pi, p e £ (0, 1) for any £ ^ k. 
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It is then sufficient to set 

( . (Pi Pk-l Pk+l PK 1 — jPfc-1 1-Pk+l 
7] = [mm <—,... , , , — , . . . , = , , . . . , — \ 

\ {Pi Pk-l Pk+l PK I- Pi 1-Pfc-i l-Pk+l 1-PK)J 
as wS X ^) ec l uals | when X IA = 1 an d iff when X lA = 0. 

We will now compute a lower bound of the expected regret in environment 9. To this 
aim, we set 

2E$R n 

9(n) := 

In the following, A k denotes the optimality gap of arm k in environment 9. Moreover the 
switch from 9 to 9 will result from Equality (|10p and from the fact that event |X^fc ^H n ) — 9{ 
is measurable with respect to X^i, . . . , Xg\ g r n \\ {I ^ k) and to X k) x, . . . , X kjn . That enables 
us to introduce the function q such that 

1 {T,^ k T e (n)<g(n)} = <l{( X k, S ) s=l..n, { X t,s)t^k, «=l..|*(n)j) 

and to write: 

E fl -ik > A fc E e -[T fc (n)] > A fc (n - s(n))Pg (T fc (n) > n - g(n)) 
= A k (n- g(n))F 9 T e (n) < g(n) 

= A k (n-g(n)) / q((x£ ;S )i^ k> s=i..t, (sfc,*)*=l--«) f| dvi{x i>s ) dv k {x k>s ) 

e^k s=l..n 
s = l..L»(n)J 

> A fc (n - 5(n))r/^( n )J / q{[x^ s ) t ^ kt s=1 .. t , (x k)S ) s= i.. n ) JJ du e (xe, s ) dv k (x ky 

i^k s=l..n 

> A k (n- g(n))rf^F e f £ T/(n) < 2(7 



! 5^ I 
« = l..Ls(n)J 



_ ..(») 



A fc (n - 5 (n))r^ [ 1 -F e [ ^T^(n) > 5 (n) 



> A fc (n- 5 (n))^) 1 



g{n) 



> A k (n-g(n))TfW 1 

> A k (n - 9 (n)W^ (l - |||) = A fe ^|M^W, 
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E /? 

We are now able to conclude. Indeed, if we assume that J > 0, then one has 



where the very first inequality is a consequence of Formula (prj . 
able to ( 

g{n) < min 2iof ^ ) for n large enough and: 
In particular, we have ^ Rn > +00, hence the conclusion. 

logn n _ > + oc 

■ 

To finish this section, note that a proof could have been written in the same way with 
a slightly different property on O: there exists two environments 6 = (z/x, • ■ • , vk) an d 
9 = (pi, . . . , vk) and k € {1, . . . , K} such that 

• k has the best mean reward in environment 6, 

• k is not the winning arm in environment 9, 

• i/£ = vi for all £ 7^ k and there exists 77 G (0, 1) such that 

^(X M ) >7] P § -a.s. 

The dilemma is then between exploring arm k or pulling the best arm of environment 9. 
5.2 There are no logarithmic bound in general 

We extend our study to more general UCB policies, and we will find that there does not 
exist logarithmic lower bounds of the expected regret in the case of Hannan consistency. 
With "ucb", we now refer to an UCB policy with indexes of the form: 



Bk,s,t = Xk iS + ' ' 

where functions fk (1 < k < K) are increasing. 

To find conditions for Hannan consistency, let us first show the following upper bound. 

Lemma 11 // arm k does not have the best mean reward, then for any (5 G (0, 1) the 
following upper bound holds: 

E 9 [T k (n)]<u+ £ [ 1 + (e" 2 ^ + e" 2 ^ <*)) , 



t=u+l 



where u 



4/fc(n) 
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Proof We adapt the arguments leading to Equation ([6]) in the proof of Theorem [3J We 
begin by noticing that, if arm k is selected, then at least one of the three following equations 
holds: 

B k*,T k *(t-i),t < 
Xk.t > + 



fk(t) 



, n(t - 1) ' 

T k (t-l)< 4 Jp, 



and the rest follows straightforwardly. 



We are now able to give sufficient conditions on the for UCB to be Hannan consistent. 

Proposition 12 // f\{n) = o{n) for all k € {1, . . . ,K}, and if there exists 7 > 5 and 
N > 1 such that fk{ n ) > 7 log logn for all k 6 {1, . . . , K} and for any n > N, then UCB is 
Hannan consistent. 



Proof Fix an index k of a suboptimal arm and choose /3 6 (0, 1) such that 2/3 7 > 1. By 
means of Lemma [TTJ one has for n large enough: 



E e [T k (n)} < u + 2 £ 1 + 7^ 



logJ_ 



3 -2/3 7 loglogt 



where u 



Consequently, we have: 



+ 



i=2 



(logt) 2 ^ log(|) (logt) 2 ^- 



n 

V 1 < 

^ (logt) c " 



t=3 



dx 



{\ogx) c 



< 



n 1 



i=2 



so that y^?_ 9 ?! ~ f n Tr^TF. On the other hand, one can write 

rn dx 



(logx) c 

n dx 



(logx) c 



X 


n f 


(logx) c _ 


+ C 


2 Jl 



As both integrals are divergent we have J 2 n 



(logx) c+r 

n dx 
2 (logx)^ 1 



(11) 



Sums of the form Ylt=2 HogW w ith c > are equivalent to r^^p as n — )• +00. Indeed, 
on the one hand we have 



so 



that f. 



n dx 



2 {logx) c 



(logn) c 
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Now, by means of Equation (jlip . there exists C > such that 

4/ fc (n)" 



E fl [T fc (n)] < 
and this proves Hannan consistency. 



+ 



(log n) 2 ^-l' 



The fact that there is no logarithmic lower bound then comes from the following propo- 
sition (which is a straightforward adaptation of Propostion[T]). 

Proposition 13 Let < b < a < 1 and n > 1. For 9 = (5 a ,<5&), i/ie random variable T2(n) 
is uniformly upper bounded by + 1. Consequently, the expected regret o/ucb is upper 
bounded by _|_ j__ 

Then, if /i(n) = f2(n) = log log n, ucb is Hannan consistent and the expected regret 
is of order log log n in all environments of the form (S a ,8b). Hence the conclusion on the 
non-existence of logarithmic lower bounds. 
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